Experiments in the Retrieval of Unsegmented Japanese Text at the NTCIR-2 Workshop
نویسنده
چکیده
Our work with the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system has made use of overlapping character n-grams in the indexing and retrieval of text. In previous experiments with Western European languages we have shown that longer length n-grams (e.g., n=6) are capable of providing an effective form of alinguistic term normalization. We have wanted to investigate whether these methods could be adapted to processing unsegmented languages such as Japanese. To that end we participated in the Japanese and English portion of the NTCIR-2 evaluation. This paper describes results in monolingual Japanese and English retrieval and in cross-language retrieval using each language as a source language for the other. We found that 6-grams performed comparably with English words and that 2-grams and 3-grams perform equally well in Japanese text. A combination of runs using each tokenization method resulted in only a marginal improvement over runs using a single approach. These two trends were consistent regardless of query length or source language.
منابع مشابه
Probabilistic Text Retrieval for NTCIR9 GeoTime
For the NTCIR-9 Workshop UC Berkeley participated only in the GeoTime track. For our initial experiments we used only the Logistic Regression ranking with blind feedback approach that we also used in NTCIR-8. We participated in both English and Japanese monolingual and bilingual search tasks. For all Japanese topics we preprocessed the text using the ChaSen morphological analyzer for term segme...
متن کاملNTCIR Workshop: an Evaluation of Cross-Lingual Information Retrieval
This paper introduces the first NTCIR Workshop, Aug.30 Sept.1, 1999, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval. The test collection used in the Workshop consists of more than 330,000 documents of English and Japanese. Twentythree groups from four countries have conducted IR tasks and submitted the searc...
متن کاملBerkeley at NTCIR-2: Chinese, Japanese, and English IR experiments
This paper reports on the work of Berkeley group at the second NTCIR workshop on Japanese & English IR and Chinese IR. A number of runs were submitted on all subtasks in the two main tasks. Our main focus on the Japanese monolingual subtask was on comparing the retrieval effectiveness of different segmentation methods. The experimental results show the bigram indexing outperformed the word-base...
متن کاملThe NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval
This paper introduces the outline of the first NTCIR Workshop, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval. The test collection used in the Workshop consists of more than 330,000 documents with more than half are EnglishJapanese paired. Twenty-three groups from four countries have conducted IR tasks and s...
متن کاملPreface of NTCIR-8
NTCIR-8 Meeting is where the groups who actively participated in one or more tasks set by NTCIR-8 report out their latest results obtained from the evaluation workshop. The NTCIR evaluation workshop series are designed to enhance research in information access technologies, including text retrieval, cross-language information access, question-answering, information extraction, text mining, etc....
متن کامل